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Abstract 

We present an original integration of high level plan¬ 
ning and execution with incoming perceptual informa¬ 
tion from vision, SLAM, topological map segmentation 
and dialogue. The task of the robot system, implement¬ 
ing the integrated model, is to explore unknown areas 
and report detected objects to an operator, by speaking 
loudly. The knowledge base of the planner maintains 
a graph-based representation of the metric map that is 
dynamically constructed via an unsupervised topolog¬ 
ical segmentation method, and augmented with infor¬ 
mation about the type and position of detected objects, 
within the map, such as cars or containers. According to 
this knowledge the cognitive robot can infer strategies 
in so generating parametric plans that are instantiated 
from the perceptual processes. Finally, a model-based 
approach for the execution and control of the robot sys¬ 
tem is proposed to monitor, concurrently, the low level 
status of the system and the execution of the activities, 
in order to achieve the goal, instructed by the operator. 

Introduction 

In real-world applications where environmental constraints 
are minimal, highly efficient multi-modal perception is a 
prerequisite for action planning and execution (Williams 
and Nayak 1997; Muscettola et al. 2002). The problem to 
be addressed translates into building meaningful, higher- 
level representations of the real-world from incoming raw 
data that are acquired from cameras and laser scanners 
and formulating these representations into a domain where 
reasoning and goal generation takes place (Murphy 2004; 
Maxwell et al. 2004; Carbone et al. 2008). 

A standard set of perception capabilities that need to 
be embodied into a mobile robot regards simultaneous lo¬ 
calization and mapping (SLAM) (Durrant-Whyte and Bai¬ 
ley 2006), and object detection and localization within the 
map (Murphy et al. 2006), (Gould et al. 2008), (Kjellstrom, 
Romero, and Kragic 2011). Furthermore, human-guided op¬ 
eration of the robot can be performed in a natural way using 
dialogue, wherein the robot receives instructions by the hu¬ 
man operator using speech recognition that translates natural 
language into predefined tasks. 

Copyright ©2011, Association for the Advancement of Artificial 
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In this paper, we present a mobile robot system that inte¬ 
grates into the Robot Operating System (ROS) (Quigley et 
al. 2009), multi-modal perception from vision and mapping 
with action planning. With respect to vision, object detec¬ 
tion is performed using several leamable detectors that are 
rapidly updated while 3D localization is estimated via sev¬ 
eral object detections that are used by a greedy algorithm 
based on RANSAC (Hurych, Zimmermann, and Svoboda 
2011). This information is used to augment a graph-based 
representation of the metrical map that is dynamically con¬ 
structed during exploration using an unsupervised topologi¬ 
cal segmentation method (Liu, Colas, and Siegwart 2011). 
The result of this process is a set of nodes that are annotated 
with properties related to the detected objects wherein con¬ 
nections between the nodes determine the traversability be¬ 
tween the corresponding areas. By formulating the perceived 
knowledge about the environment into a suitable logic rep¬ 
resentation that is maintained into the knowledge base (Pirri 
and Reiter 2000; Pirri 2011), the logic-based planner can 
build a set of tasks, whose goal is communicated by the op¬ 
erator, through dialogue. The planner both verifies the con¬ 
sistency of their executability and monitors its execution re¬ 
porting possible failures. 

Topological Segmentation of Metric Map 

The perception of the topology of the environment through 
mapping is initially represented within a metric layer and 
in the following as a higher-level topological layer, used for 
action planning. 

Using the sensors of the robot, we first build a metric map 
using standard SLAM algorithm. 1 From wheel odometry 
and a 180° 2D laser scanner, we obtain an occupancy grid 
using Rao-Blackwellized particle filtering (Grisetti, Stach- 
niss, and Burgard 2007). 

Based on this occupancy grid, we perform a topological 
segmentation of the free space in the metric layer. Instead of 
working at the discretization level of the occupancy grid, we 
down-sample it to zones of size 1 m 2 that will be referred to 
as “nodes” in the rest of this section. 

We use spectral clustering (Brunskill, Kollar, and Roy 
2007) as the segmentation method of the metrical map. The 

1 We use the GMapping software in ROS: 

http://www.ros.org/wiki/gmapping. 



general algorithm of spectral clustering requires the neigh¬ 
borhood graph together with the corresponding adjacency 
matrix W with n x n elements W (z, j) = co^ , where n is 
the number of nodes in the graph. Among the different ap¬ 
proaches that have been considered in the definition of the 
weights COij between nodes, in our work, we define it as 
Ktj) 

COij = e ° 2 where l(ij) is the distance between the cen¬ 
ters of nodes i and j. Following the notation of our previous 
work in (Liu, Colas, and Siegwart 2011), the algorithm of 
spectral clustering is shown in Alg. 1 


Algorithm 1: Spectral Clustering on Topological Segmenta- 
tion_ 

Input: W —)> Adjacency matrix, where W{i,j) indicates 
the weight between two nodes i and j; 
k —* The number of clusters-areas; 

T = ■ - Tk} — {1 j 2,, fe}; 

Output: The list of indices corresponding to the nodes. 

1 Calculate the normalized graph Laplacian using 

L sym := / - D- * l l 2 WD~ x l 2 or L rw -I-D~ l W, where 
D = diagjc/,and dj = £" =] 

2 Calculate the k smallest eigenvectors u \,..., of L 
(either L sym or L rw ), form the matrix 

U = [ Ul ...u k ]eR nxk - 

3 Set U to be U with rows normalized to the unit L 2 norm; 

4 Use k -means clustering on the rows of U ; 

5 Assign label 7] to cluster j if and only if row j of U is 
assigned to cluster j; 


In order to enable the robot to execute plans that remain 
consistent during the discovery of new, previously unex¬ 
plored areas, we need to ensure the consistency of the topo¬ 
logical map segmentation in time. In other words, a new seg¬ 
mentation of the map should build upon the last instance of 
segmented map instead of employing a segmentation of the 
complete map. To enable the spectral clustering method to 
work incrementally, we set up a simple mechanism as shown 
in Alg. 2 for each iteration. 


Algorithm 2: Incremental Segmentation 

Input: M t —>> Decomposition results of the occupancy 
grid map; 

Obs t Threshold determining sufficient new 
observations; 

Output: The updated list of topological regions T . 

1 Retrieve the newly updated free nodes from M t , with a 
total surface area A reap, 
if Area t < Ohs t then 
Return ; 
else 

new_list_of_regions = Do_SpectralClustering(M r ); 
Clean_up(); 

Register(7\ new_list_of.regions); 

end if 



Figure 1: Topological decomposition of a metrical map us¬ 
ing the proposed methodology. 

In Figure 1 we show an example of topologically seg¬ 
mented metrical map based on the described methodology. 

In order to use the topological decomposition of the metri¬ 
cal map for plan generation, we extract the centroid of each 
area and use it to denote the position of a node within the 
graph-based representation of the topological map while the 
edges between the nodes are determined according to the 
proximity of one node with respect to another based on a 
distance threshold. The topological segmentation of the met¬ 
rical map is a continuous process running in parallel with 
object detection and localization that in turn, augments the 
domain knowledge stored in the topological graph with in¬ 
coming information from vision. 

Object Detection and Annotation of 
Topological Graph 

Visual perception of the environment by the mobile robot is 
implemented by object detection and localization using an 
omnidirectional camera. The information about the detected 
objects is then introduced into the graph-based representa¬ 
tion of the topologically segmented map in the form of graph 
nodes around the object that are attached onto the graph and 
by storing information related to the detected object. 

Online learnable object detector We use an online learn- 
able object detector (Hurych, Zimmermann, and Svoboda 
2011) that is based on an efficient combination of (Lep- 
etit, Lagger, and Fua 2005) and (Kalal, Matas, and Miko- 
lajczyk 2010). There are two key ideas we are using: Kalal 
et al. (Kalal, Matas, and Mikolajczyk 2010) showed that a 
tracker allows for efficient boosting of a detector and Lepetit 
et al. (Lepetit, Lagger, and Fua 2005) showed that multiple 
detectors using the same set of features can run almost at the 
same rate as a single detector. In contrast to (Kalal, Matas, 
and Mikolajczyk 2010), we use several rapidly updated de¬ 
tectors instead of a tracker that use the same features but are 
updated with different parameters, yielding similar boosting 
ability as a tracker, while running in real-time. 









Category car detector The above described online learn- 
able detector essentially detects instances of a certain ob¬ 
ject. What is crucial is the visual similarity of the instance to 
the object that has been selected for learning. The individ¬ 
uality of the learnable detector makes general car detection 
hard. Car surface is usually highly specular while changing 
illumination influences its appearance significantly. We ap¬ 
plied Adaboost detector from OpenCV library (Bradski and 
Kaehler 2008) for detection of rear car part. For learning we 
partly used available datasets and partly assembled our own. 


3D localization of detected objects Object detectors lo¬ 
calize objects of interest in captured images. This informa¬ 
tion needs to be transformed into the corresponding position 
in the 3D world so that the robot can infer strategies in ap¬ 
proaching or avoiding objects depending on the task. 

For object localization in 3D, a new ROS component 
has been developed. This component depends on the robot 
pose estimated from odometry and 2D laser scan data, on 
static transforms which relate the omnidirectional (Lady- 
bug3) camera to the robot base and on internal camera cal¬ 
ibration. These transforms are used to resolve directions in 
which the objects were detected within a particular refer¬ 
ence frame. From several detections of a particular object 
class, a number of objects and their poses are estimated us¬ 
ing a greedy algorithm based on RANSAC. In this way, the 
system tries to interpret all these detections following the 
minimum description length principle. An example of car 
detection and localization within the map is shown in Figure 
2 while the localization algorithm is given in Alg. 3. 

After the 3D localization of an object we associate a set of 
nodes around the object that correspond to the areas that the 
robot can reach (left, right, back, front) in order to approach 
the object. In this way, the topological graph representation 
of the metric map is enriched with perceptual information 
coming from vision giving an augmented graph-based repre¬ 
sentation that the planner is using to generate complex plans. 
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Figure 2: Example of car detection in one of the LB 3 cam¬ 
eras and car models placed in 3D map. 


Algorithm 3: 3D localization of objects 
Input: D Set of unprocessed object detections 

consisting of position and direction in common 
reference frame; 

Dmin minimum number of consistent detections for 
object to be localized; i max —>> maximum number of 
iterations; 

Output: Set of detected objects O with resolved 
position and set of consistent detections. 
Initialize set of objects O <— {} 
while \D\ > D min do 

Initialize new object (/?*,D*) with empty consensus 
set D* <- {} 

for i 4— 1 to i max do 

Compute position p from two detections Do 
randomly chosen from D 
Find detections D\ from D consistent with p 
Initialize refinement counter j <— 0 
while \Dj+\ \ > \Dj \ do 
Set j<-j + 1 

Refine position p using all detections Dj 
Find detections Dj+ \ from D consistent with p 
end while 
if \Dj\ > \D*\ then 

Update new object p* <— p, D* <— Dj 
end if 
end for 

if \D* | > Dmin then 

Accept new object d^-dU{(/,D*)} 

Remove processed detections D <— D \D* 
else 

return O 
end if 
end while 


Dialogue 

In this paper, we consider settings in which a human opera¬ 
tor is at a remote location, away from the disaster area into 
which the robot is deployed. The human and the robot inter¬ 
act through an Operator Control Unit (OCU). The OCU is 
illustrated in Figure 3. 

The OCU provides the operator with a visualized map, 
camera stream, as well as plan and dialogue histories. The 
OCU facilitates multi-modal interaction: The operator and 
the robot can use spoken dialogue to interact, and the oper¬ 
ator can use pointing gestures (mouse) to select waypoints 
or landmarks for the robot to go to. A gesture can but need 
not be accompanied by an utterance. Selecting a waypoint, 
or saying ”Go here [select waypoint]” have in principle the 
same interpretation, namely that the operator intends the 
robot to go to the selected location. 

The operator and the robot are working together on a task 
- namely to jointly explore an environment. This makes the 
interaction inherently task-driven. The idea of interpreting 
communicative acts in terms of their underlying intention 
(operator goal) therefore plays a fundamental role in the 
OCU design. It enables us to connect interaction to action in 
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Figure 3: Operator Control Unit, with visualization of map, 
camera stream, and plan- and dialogue histories 

the world, in a way as proposed by the GUI design guide¬ 
lines of (Goodrich and Olsen 2003), and models of situ¬ 
ated dialogue processing like in (Kruijff, Janicek, and Lison 
2010). The interpreted intention is grounded in the “world” 
by resolving the references in the utterances to aspects in 
the robot’s situation awareness that is maintained and up¬ 
dated into the knowledge base of the planner. These refer¬ 
ences can be referring expressions like “the car” as in “go 
to the car,” but can also be deictic references such as “here” 
or selected waypoints or detected objects. The resulting rep¬ 
resentation provides a smooth bridge to planning by stating 
what (abstract) goal the robot is to achieve, and relative to 
what locations. 

The Logic-based Robot Control 

The robot control is endowed with a declarative temporal 
model of the controllable activities and a planning engine. 
The structure of the architecture is shown in Figure 4. 



Figure 4: The robot control architecture. 

The declarative temporal model is specified in the Tem¬ 
poral Flexible Situation Calculus (TFSC) (Finzi and Pirri 
2005) and explicitly represents the main components and 
processes of the robot system, the cause-effect relation¬ 
ships as well as the temporal constraints among the pro¬ 


cesses. The TFSC extends the language of a basic theory 
of actions in the Situation Calculus (Pirri and Reiter 1999; 
Reiter 2001), combining temporal constraint reasoning and 
reasoning about actions. It intermediates between Situa¬ 
tion Calculus formulae and temporal constraints networks 
(Dechter, Meiri, and Pearl 1991). 

Within this framework, the system is modeled as a set 
of components specifying their activities over timelines. In 
the implementation that is presented in this paper, we make 
use of three components specified as slam, navigation and 
vision that trigger a set of processes, according to their role. 
In detail, the slam component performs a toposeg process to 
segment the metric map of the environment, the navigation 
component executes a goto mode (nod eld) process to reach 
a target node nodeld and the vision component performs a 
detect (object) process that detects objects in the acquired 
video stream and localizes them within the map. 

These processes are explicitly represented through 
fluents and instantaneous starting and ending actions 
which are defined in terms of preconditions and ef¬ 
fects. For example, the toposeg process is modeled by 
the fluent process(slam,toposeg, s) and both the actions 
start Joposeg(t) and endJoposeg(t). The effects are defined 
by the following successor state axiom: 

process(slam,toposeg, do(a,s)) = 

3t a = start Joposeg(t)M 
process(slam, toposeg, s) A 
-i 3t' (a = endJoposeg(t')) 

where the action preconditions are: 

Poss(start Joposeg(t), s) = idle(slam,s) A 

time(s) < t 

Poss(endJoposeg(t),s) = process(slam, toposeg, s)/\ 

time(s) < t 

and the hard time constraints among activities are man¬ 
aged by the TFSC model using Allen-like temporal relations 
(Allen 1983). 

The planning engine is composed of two main logical 
modules: the plan generator and the execution monitoring. 
The plan generator relies on a library of Prolog scripts des¬ 
ignating the set of tasks which the mobile robot can per¬ 
form, according to the specified processes, their temporal 
constraints (compatibilities), and preconditions. For exam¬ 
ple: 

• Go here: navigate to a position within the metric map. 

• Move left, right, forward: execute simple motion com¬ 
mands. 

• Visit graph: visit all the nodes of the graph-based repre¬ 
sentation of the metrical map. 

• Visit node: visit a specific node of the graph-based repre¬ 
sentation of the metrical map, following the optimal path 
within the graph according to a selected criterion (e.g. the 
distance between the nodes). 

• Approach object: visit a node (left, right, front or back) 
around a detected object. 









































These tasks are based on the perceived information of the 
environment using vision and slam , which is represented in 
the knowledge of the planning engine in the form of a graph 
(see Fig. 5). The nodes of the graph correspond to topologi¬ 
cally segmented regions or approachable areas around a de¬ 
tected object and edges determine the traversability between 
the regions. 

The execution-monitoring is a continuous process which 
ensures that the set of action sequences, generated by the 
plan generator according to the TFSC model and the cur¬ 
rent state of the domain knowledge, are consistently exe¬ 
cuted. Concurrently, at regular time intervals, the execution¬ 
monitoring reads the system state and monitors the execu¬ 
tion of the activities, in order to detect system malfunction¬ 
ing that may result in action failures (De Giacomo, Reiter, 
and Soutchanski 1998; Pettersson 2005). 

Both the TFSC model and the planning engine are imple¬ 
mented in ECLIPSE Prolog (Apt and Wallace 2006) which 
optimally combines the power of a constraint solver (for the 
time and compatibility constraints) with inference in order 
to generate the set of action sequences, and also enable the 
continuous update due to incoming new knowledge (using 
finite progression). 

The planner embedded into the ROS architecture The 

declarative temporal model and the planning engine are fully 
embedded into ROS. The underlying implementation in¬ 
volves the following ROS nodes: 

1. A ROS Service node (RSN); 

2. A ROS Robot node (RRN). 

The RSN allows for RPC request-reply interactions with 
the user interfaces while the RRN takes care of the commu¬ 
nication between the perceptual and physical robot compo¬ 
nents and the planning engine. 

The RSN manages the communication with the human in¬ 
terfaces and embeds into the ROS language the logical part 
of the control system. It enables the human operator to inter¬ 
act with the control system during the computational cycle. 
In fact, the operator can interact with the system by post¬ 
ing goals, and he/she can also interrupt the execution of a 
plan, generated by the planning engine, or directly control 
the perceptual and physical robot components. 



Figure 5: An instance of topological map segmentation to¬ 
gether with the corresponding node centers and connections 
within the nodes that is used by the robot to generate plans. 


On the other hand the RRN receives the information from 
the different perceptual modalities (mapping and vision) in 
order to build the domain knowledge of the robot. The RRN 
is also responsible for sending task activation signals to the 
robot components in order to perform the sequence of ac¬ 
tions generated by the planner, according to the operator re¬ 
quests. 

More specifically, the RRN takes as input the segmented 
metric map and creates a logical graph-based representation 
of the explored area. Nodes and edges of the graph are speci¬ 
fied by predicates together with additional information about 
the edges (capacity, distance) and the nodes (surface area, 
convexity). According to this representation the plan gener¬ 
ator is able to infer shortest paths along the graph, inducing 
from this a plan, whenever a node-to-node navigation task is 
requested by the operator. 

Similarly, the visual perception provides the RRN with 
information about detected objects. Whenever an object is 
detected and localized within the metric map, a new set of 
nodes is added to the topological graph, circumscribing the 
object, and the corresponding properties are specified in the 
action theory. The orientation of the object can be used to 
determine the correspondence of these nodes with respect 
to the areas at the front, back, left and right of the object. 
This allows the plan generator to make a plan, to suitably 
approach the detected object, and also it enables the model- 
based planner to reason about the augmented logical struc¬ 
ture. 

The RSN receives a task request from the user, creates 
a logical representation of the request and sends the cor¬ 
responding Prolog term to the execution-monitoring. The 
execution-monitoring asks the planner to generate an ex¬ 
ecutable set of action sequences, according to the current 
state of the knowledge base. Once the plan is generated, the 
execution-monitoring sends these actions for execution to 
the RRN. Upon receipt of an action, the RRN takes the con¬ 
trol managing the physical execution of the action. When 
an action is performed, the execution-monitoring retrieves 
the state of the robot, whence it verifies whether the action 
was successful or not. In the former case, the execution¬ 
monitoring sends the next action to the RRN, otherwise it 
aborts the execution of the remaining plan and notifies the 
failure to the RSN. In turn, the RSN returns the result of 
the execution of the action to the interface in so yielding the 
control to the operator. 

Conclusion 

In this paper, we have presented a mobile robot system that 
employs high-level control in order to operate in a real-world 
setting where the main task is human-assisted exploration 
of an environment. In the presented system, we have in¬ 
tegrated multi-modal perception from vision and mapping 
with a model-based executive control. We have also showed 
how the system allows the interaction between the human 
operator and the robot platform via the dialogue-based com¬ 
munication. In this framework, action planning is performed 
using a high-level representation of the environment that 
is obtained through topological segmentation of the metric 
map and object detection and 3D localization in the map. 




This representation has the form of a graph where all the 
information related to the spatial characteristics of the en¬ 
vironment is stored into properties that are annotated to the 
nodes and the edges of the graph that is used by the planner 
to generate task-dependent plans. The control system moni¬ 
tors the execution of the action sequences and communicates 
the status through the dialogue. 
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